666 research outputs found

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    Optimal Layout Synthesis for Quantum Computing

    Full text link
    Recent years have witnessed the fast development of quantum computing. Researchers around the world are eager to run larger and larger quantum algorithms that promise speedups impossible to any classical algorithm. However, the available quantum computers are still volatile and error-prone. Thus, layout synthesis, which transforms quantum programs to meet these hardware limitations, is a crucial step in the realization of quantum computing. In this paper, we present two synthesizers, one optimal and one approximate but nearly optimal. Although a few optimal approaches to this problem have been published, our optimal synthesizer explores a larger solution space, thus is optimal in a stronger sense. In addition, it reduces time and space complexity exponentially compared to some leading optimal approaches. The key to this success is a more efficient spacetime-based variable encoding of the layout synthesis problem as a mathematical programming problem. By slightly changing our formulation, we arrive at an approximate synthesizer that is even more efficient and outperforms some leading heuristic approaches, in terms of additional gate cost, by up to 100%, and also fidelity by up to 10x on a comprehensive set of benchmark programs and architectures. For a specific family of quantum programs named QAOA, which is deemed to be a promising application for near-term quantum computers, we further adjust the approximate synthesizer by taking commutation into consideration, achieving up to 75% reduction in depth and up to 65% reduction in additional cost compared to the tool used in a leading QAOA study.Comment: to appear in ICCAD'2

    Optimal Qubit Mapping with Simultaneous Gate Absorption

    Full text link
    Before quantum error correction (QEC) is achieved, quantum computers focus on noisy intermediate-scale quantum (NISQ) applications. Compared to the well-known quantum algorithms requiring QEC, like Shor's or Grover's algorithm, NISQ applications have different structures and properties to exploit in compilation. A key step in compilation is mapping the qubits in the program to physical qubits on a given quantum computer, which has been shown to be an NP-hard problem. In this paper, we present OLSQ-GA, an optimal qubit mapper with a key feature of simultaneous SWAP gate absorption during qubit mapping, which we show to be a very effective optimization technique for NISQ applications. For the class of quantum approximate optimization algorithm (QAOA), an important NISQ application, OLSQ-GA reduces depth by up to 50.0% and SWAP count by 100% compared to other state-of-the-art methods, which translates to 55.9% fidelity improvement. The solution optimality of OLSQ-GA is achieved by the exact SMT formulation. For better scalability, we augment our approach with additional constraints in the form of initial mapping or alternating matching, which speeds up OLSQ-GA by up to 272X with no or little loss of optimality.Comment: 8 pages, 8 figures, to appear in ICCAD'2

    Placement-Driven Technology Mapping for LUT-Based FPGAs

    Get PDF
    In this paper, we study the problem of placement-driven technology mapping for table-lookup based FPGA architectures to optimize circuit performance. Early work on technology mapping for FPGAs such as Chortle-d[14] and Flowmap[3] aim to optimize the depth of the mapped solution without consideration of interconnect delay. Later works such as Flowmap-d[7], Bias-Clus[4] and EdgeMap consider interconnect delays during mapping, but do not take into consideration the effects of their mapping solution on the final placement. Our work focuses on the interaction between the mapping and placement stages. First, the interconnect delay information is estimated from the placement, and used during the labeling process. A placement-based mapping solution which considers both global cell congestion and local cell congestion is then developed. Finally, a legalization step and detailed placement is performed to realize the design. We have implemented our algorithm in a LUT based FPGA technology mapping package named PDM (Placement-Driven Mapping) and tested the implementation on a set of MCNC benchmarks. We use the tool VPR[1][2] for placement and routing of the mapped netlist. Experimental results show the longest path delay on a set of large MCNC benchmarks decreased by 12.3 % on the average

    Simultaneous timing driven clustering and placement for FPGAs

    Get PDF
    Abstract. Traditional placement algorithms for FPGAs are normally carried out on a fixed clustering solution of a circuit. The impact of clustering on wirelength and delay of the placement solutions is not well quantified. In this paper, we present an algorithm named SCPlace that performs simultaneous clustering and placement to minimize both the total wirelength and longest path delay. We also incorporate a recently proposed path counting-based net weighting schem

    Compiling Quantum Circuits for Dynamically Field-Programmable Neutral Atoms Array Processors

    Full text link
    Dynamically field-programmable qubit arrays (DPQA) have recently emerged as a promising platform for quantum information processing. In DPQA, atomic qubits are selectively loaded into arrays of optical traps that can be reconfigured during the computation itself. Leveraging qubit transport and parallel, entangling quantum operations, different pairs of qubits, even those initially far away, can be entangled at different stages of the quantum program execution. Such reconfigurability and non-local connectivity present new challenges for compilation, especially in the layout synthesis step which places and routes the qubits and schedules the gates. In this paper, we consider a DPQA architecture that contains multiple arrays and supports 2D array movements, representing cutting-edge experimental platforms. Within this architecture, we discretize the state space and formulate layout synthesis as a satisfactory modulo theories problem, which can be solved by existing solvers optimally in terms of circuit depth. For a set of benchmark circuits generated by random graphs with complex connectivities, our compiler OLSQ-DPQA reduces the number of two-qubit entangling gates on small problem instances by 1.7x compared to optimal compilation results on a fixed planar architecture. To further improve scalability and practicality of the method, we introduce a greedy heuristic inspired by the iterative peeling approach in classical integrated circuit routing. Using a hybrid approach that combined the greedy and optimal methods, we demonstrate that our DPQA-based compiled circuits feature reduced scaling overhead compared to a grid fixed architecture, resulting in 5.1X less two-qubit gates for 90 qubit quantum circuits. These methods enable programmable, complex quantum circuits with neutral atom quantum computers, as well as informing both future compilers and future hardware choices.Comment: An extended abstract of this work was presented at the 41st International Conference on Computer-Aided Design (ICCAD '22
    • …
    corecore